SHOWCASING WORK IN PROGRESS

COMPLETED, BUT UNPOLISHED, WORK AVAILABLE HERE

Project Overview

  • Goal: Help CharityML maximize the likelihood of receiving dontations
  • How: Construct a model that accurately predicts whether an individual makes more than 50k/yr
  • Data Source: 1994 US Census Data UCI Machine Learning Repository

Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

In [1]:
import numpy as np                                # Package for numerical computing with Python
import pandas as pd                               # Package to work with data in tabular form and the like
from scipy.stats import skew
from time import time                             # Package to work with time values

from IPython.display import display               # Allows the use of display() for DataFrames
import matplotlib.pyplot as plt                   # Package for plotting
import seaborn as sns                             # Package for plotting, prettier than matplotlib
import visuals as vs                              # Adapted from Udacity
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer
In [2]:
# iPython Notebook formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Account for changes made to imported packages
%load_ext autoreload
%autoreload 2
In [3]:
data = pd.read_csv("census.csv")

EDA: Data Dictionary

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education_level: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours_per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
In [4]:
data.info(null_counts=True)   # Show information for each factor: NaN counts and data-type of column
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              45222 non-null  int64  
 1   workclass        45222 non-null  object 
 2   education_level  45222 non-null  object 
 3   education-num    45222 non-null  float64
 4   marital-status   45222 non-null  object 
 5   occupation       45222 non-null  object 
 6   relationship     45222 non-null  object 
 7   race             45222 non-null  object 
 8   sex              45222 non-null  object 
 9   capital-gain     45222 non-null  float64
 10  capital-loss     45222 non-null  float64
 11  hours-per-week   45222 non-null  float64
 12  native-country   45222 non-null  object 
 13  income           45222 non-null  object 
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB
In [5]:
data.describe(include='all').T    # Summarize each factor, transpose the summary (personal preference)
Out[5]:
count unique top freq mean std min 25% 50% 75% max
age 45222 NaN NaN NaN 38.5479 13.2179 17 28 37 47 90
workclass 45222 7 Private 33307 NaN NaN NaN NaN NaN NaN NaN
education_level 45222 16 HS-grad 14783 NaN NaN NaN NaN NaN NaN NaN
education-num 45222 NaN NaN NaN 10.1185 2.55288 1 9 10 13 16
marital-status 45222 7 Married-civ-spouse 21055 NaN NaN NaN NaN NaN NaN NaN
occupation 45222 14 Craft-repair 6020 NaN NaN NaN NaN NaN NaN NaN
relationship 45222 6 Husband 18666 NaN NaN NaN NaN NaN NaN NaN
race 45222 5 White 38903 NaN NaN NaN NaN NaN NaN NaN
sex 45222 2 Male 30527 NaN NaN NaN NaN NaN NaN NaN
capital-gain 45222 NaN NaN NaN 1101.43 7506.43 0 0 0 0 99999
capital-loss 45222 NaN NaN NaN 88.5954 404.956 0 0 0 0 4356
hours-per-week 45222 NaN NaN NaN 40.938 12.0075 1 40 40 45 99
native-country 45222 41 United-States 41292 NaN NaN NaN NaN NaN NaN NaN
income 45222 2 <=50K 34014 NaN NaN NaN NaN NaN NaN NaN
In [6]:
n_records = data.shape[0]                                                   # First element of .shape indicates n
n_greater_50k = data[data['income'] == '>50K'].shape[0]                     # n of those with income > 50k
n_at_most_50k = data.where(data['income'] == '<=50K').dropna().shape[0]     # .where method requires dropping of na for this
greater_percent = round((n_greater_50k / n_records)*100,2)                  # Show proportion of > 50k to whole data

data_details = {"Number of observations": n_records,
                "Number of people with income > 50k": n_greater_50k,
                "Number of people with income <= 50k": n_at_most_50k,
                "Percent of people with income > 50k": greater_percent}     # Cache values of analysis

for item in data_details:                                                   # Iterate through the cache
    print("{0}: {1}".format(item, data_details[item]))                      # Print the values
Number of observations: 45222
Number of people with income > 50k: 11208
Number of people with income <= 50k: 34014
Percent of people with income > 50k: 24.78

EDA: Data Distribution

  • Income
  • Age
  • Workclass
  • Education
  • Marital Status
  • Relationship
  • Race
  • Sex
  • Hours-per-Week
  • Pair-wise
In [7]:
fig = px.histogram(data, x="income", nbins=2)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Income",
                  showlegend=False)
fig.update_yaxes(title_text="Number of Records")
fig.show()
In [8]:
fig = px.histogram(data, x="age", nbins=data['age'].nunique(), color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Age",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Age")
fig.show()
In [9]:
fig = px.histogram(data, x="workclass", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Workclass",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Workclass", )
fig.show()
In [10]:
fig = px.histogram(data, x="education_level", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Education",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Education", )
fig.show()
In [11]:
fig = px.histogram(data, x="marital-status", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Marital-Status",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Marital-Status")
fig.show()
In [12]:
fig = px.histogram(data, x="occupation", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Occupation",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Occupation", )
fig.show()
In [13]:
fig = px.histogram(data, x="relationship", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Relationship",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Relationship", )
fig.show()
In [14]:
fig = px.histogram(data, x="race", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Race",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Race", )
fig.show()
In [15]:
fig = px.histogram(data, x="sex", color='income', opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Sex",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Classification of Sex", )
fig.show()
In [16]:
fig = px.histogram(data, x="hours-per-week", color='income', nbins=10, opacity=0.75)
fig.update_layout(height=600, width=750,
                  title_text="Distribution of Hours-per-Week",
                  showlegend=True)
fig.update_yaxes(title_text="Number of Records")
fig.update_xaxes(title_text="Hours Worked in a Week")
fig.show()
In [17]:
sns.set_context("paper", rc={"font.size":16,
                             "axes.titlesize":16,
                             "axes.labelsize":16,
                             "lines.linewidth": 2.5,
                             "legend.fontsize":12})
sns.pairplot(data[['income', 'age', 'education-num', 'hours-per-week']], 
             kind="reg", 
             hue='income', 
             height=4, 
             plot_kws=dict(scatter_kws=dict(s=9)))
plt.show()

Data Engineering/Preprocessing

Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.

Eng: Factor Names

Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.

In [18]:
name_changes = {x: x.replace("-", "_") for x in data.columns.tolist() if "-" in x}
data = data.rename(columns=name_changes)

Eng: Categorical Transformations

Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.

These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).

In [19]:
ord_vars = ['income', 'workclass', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
nom_vars = ['education_level']
map_dict = {}
for name in ord_vars:
    map_dict[name] = {category:number for number,category in enumerate(data[name].unique())}
# map_dict
ed_lev_cat = {' Doctorate': 0,
              ' Prof-school': 1,
              ' Masters': 2,
              ' Bachelors': 3,
              ' Assoc-voc': 4,
              ' Assoc-acdm': 5,
              ' Some-college': 6,
              ' HS-grad': 7,
              ' 12th': 8,
              ' 11th': 9,
              ' 10th': 10,
              ' 9th': 11,
              ' 7th-8th': 12,
              ' 5th-6th': 13,
              ' 1st-4th': 14,
              ' Preschool': 15}

map_dict['education_level'] = ed_lev_cat
for name in map_dict:
    data['numeric_' + name] = data[name].map(map_dict[name])
In [20]:
for name in map_dict.keys():
    if name != 'native_country':
        message = 'Mapping for variable: numeric_{}'.format(name)
        print("=" * len(message))
        print(message)
        map_df = pd.DataFrame.from_dict(map_dict[name], orient='index').reset_index().rename(columns={'index': 'Factor Value', 0: 'Numerical Value'}).sort_values(by=['Numerical Value'])
        display(map_df)
====================================
Mapping for variable: numeric_income
Factor Value Numerical Value
0 <=50K 0
1 >50K 1
=======================================
Mapping for variable: numeric_workclass
Factor Value Numerical Value
0 State-gov 0
1 Self-emp-not-inc 1
2 Private 2
3 Federal-gov 3
4 Local-gov 4
5 Self-emp-inc 5
6 Without-pay 6
============================================
Mapping for variable: numeric_marital_status
Factor Value Numerical Value
0 Never-married 0
1 Married-civ-spouse 1
2 Divorced 2
3 Married-spouse-absent 3
4 Separated 4
5 Married-AF-spouse 5
6 Widowed 6
========================================
Mapping for variable: numeric_occupation
Factor Value Numerical Value
0 Adm-clerical 0
1 Exec-managerial 1
2 Handlers-cleaners 2
3 Prof-specialty 3
4 Other-service 4
5 Sales 5
6 Transport-moving 6
7 Farming-fishing 7
8 Machine-op-inspct 8
9 Tech-support 9
10 Craft-repair 10
11 Protective-serv 11
12 Armed-Forces 12
13 Priv-house-serv 13
==========================================
Mapping for variable: numeric_relationship
Factor Value Numerical Value
0 Not-in-family 0
1 Husband 1
2 Wife 2
3 Own-child 3
4 Unmarried 4
5 Other-relative 5
==================================
Mapping for variable: numeric_race
Factor Value Numerical Value
0 White 0
1 Black 1
2 Asian-Pac-Islander 2
3 Amer-Indian-Eskimo 3
4 Other 4
=================================
Mapping for variable: numeric_sex
Factor Value Numerical Value
0 Male 0
1 Female 1
=============================================
Mapping for variable: numeric_education_level
Factor Value Numerical Value
0 Doctorate 0
1 Prof-school 1
2 Masters 2
3 Bachelors 3
4 Assoc-voc 4
5 Assoc-acdm 5
6 Some-college 6
7 HS-grad 7
8 12th 8
9 11th 9
10 10th 10
11 9th 11
12 7th-8th 12
13 5th-6th 13
14 1st-4th 14
15 Preschool 15

Eng: Data Separation

For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).

In [21]:
Y_income = data[['income', 'numeric_income']]
X = data.drop(['income', 'numeric_income'], axis=1)

Skew

The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).

To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.

Why does this matter: The extreme points may affect the performance of the predictive model.

Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.

Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.

In [22]:
cap_loss = X['capital_loss']
cap_gain = X['capital_gain']
cap_loss_skew, cap_loss_var, cap_loss_mean = skew(cap_loss), np.var(cap_loss), np.mean(cap_loss)
cap_gain_skew, cap_gain_var, cap_gain_mean = skew(cap_gain), np.var(cap_gain), np.mean(cap_gain)
fac_df = pd.DataFrame({'Feature': ['Capital Loss', 'Capital Gain'],
              'Skewness': [cap_loss_skew, cap_gain_skew],
              'Mean': [cap_loss_mean, cap_gain_mean],
              'Variance': [cap_loss_var, cap_gain_var]})
display(fac_df)
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 1.639858e+05
1 Capital Gain 11.788611 1101.430344 5.634525e+07
In [23]:
fig = make_subplots(rows=2, cols=1)
fig.update_layout(height=800, width=850,
                  title_text="Skewed Distributions of Continuous Census Data Features",
                  showlegend=False
                 )
fig.add_trace(
    go.Histogram(x=X['capital_loss'], nbinsx=25,
    name='Capital-Loss'), 
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=X['capital_gain'], nbinsx=25,
    name='Capital-Gain'),
    row=2, col=1
)
fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=2, col=1)
for i in range(1,5):
    fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                     patch = dict(
                         tickmode = 'array',
                         tickvals = [0, 500, 1000, 1500, 2000],
                         ticktext = [0, 500, 1000, 1500, ">2000"]),
                     row=i, col=1)
fig.show()

Eng: Apply Logarithmic Transformation:

Again, to reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.

In [24]:
skewed = ['capital_gain', 'capital_loss']
X_log_transformed = pd.DataFrame(data=X).copy()
X_log_transformed[skewed] = X[skewed].apply(lambda x : np.log(x + 1))
In [25]:
fig = make_subplots(rows=2, cols=1)
fig.update_layout(height=800, width=850,
                  title_text="Skewed Distributions of Continuous Census Data Features",
                  showlegend=False
                 )
fig.add_trace(
    go.Histogram(x=X_log_transformed['capital_loss'], nbinsx=25,
    name='Log of Capital-Loss'),
    row=1, col=1

)
fig.add_trace(
    go.Histogram(x=X_log_transformed['capital_gain'], nbinsx=25,
    name='Log of Capital-Gain'),
    row=2, col=1
)
fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=2, col=1)
for i in range(1,3):
    fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                     patch = dict(
                         tickmode = 'array',
                         tickvals = [0, 500, 1000, 1500, 2000],
                         ticktext = [0, 500, 1000, 1500, ">2000"]),
                     row=i, col=1)
fig.show()
In [26]:
log_cap_loss_skew = skew(X_log_transformed['capital_loss'])
log_cap_loss_var = round(np.var(X_log_transformed['capital_loss']),5)
log_cap_loss_mean = np.mean(X_log_transformed['capital_loss'])
log_cap_gain_skew = skew(X_log_transformed['capital_gain'])
log_cap_gain_var = round(float(np.var(X_log_transformed['capital_gain'])),5)
log_cap_gain_mean = np.mean(X_log_transformed['capital_gain'])
log_fac_df = pd.DataFrame({'Feature': ['Log Capital Loss', 'Log Capital Gain'],
              'Skewness': [log_cap_loss_skew, log_cap_gain_skew],
              'Mean': [log_cap_loss_mean, log_cap_gain_mean],
              'Variance': [log_cap_loss_var, log_cap_gain_var]})
fac_df = fac_df.append(log_fac_df, ignore_index=True)
fac_df['Variance'] = fac_df['Variance'].apply(lambda x: '%.5f' % x)
display(fac_df)
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 163985.81018
1 Capital Gain 11.788611 1101.430344 56345246.60482
2 Log Capital Loss 4.271053 0.355489 2.54688
3 Log Capital Gain 3.082284 0.740759 6.08362
In [27]:
fig = make_subplots(rows=4, cols=1)
fig.update_layout(height=800, width=850,
                  title_text="Comparison of Distributions of Continuous Census Data Features",
                  showlegend=False
                 )
fig.add_trace(
    go.Histogram(x=X['capital_loss'], nbinsx=25,
    name='Capital-Loss'),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=X_log_transformed['capital_loss'], nbinsx=25,
    name='Log of Capital-Loss'),
    row=2, col=1
)
fig.add_trace(
    go.Histogram(x=X['capital_gain'], nbinsx=25,
    name='Normalized Capital-Gain'),
    row=3, col=1
)
fig.add_trace(
    go.Histogram(x=X_log_transformed['capital_gain'], nbinsx=25,
    name='Capital-Gain'),
    row=4, col=1
)
fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=2, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=3, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=4, col=1)
for i in range(1,5):
    fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                     patch = dict(
                         tickmode = 'array',
                         tickvals = [0, 500, 1000, 1500, 2000],
                         ticktext = [0, 500, 1000, 1500, ">2000"]),
                     row=i, col=1)
fig.show()

Eng: Impact of Transformation

Originally, the influence of capital_loss on income was statistically significant, but after the logarithmic transformation, it is not.

Here it can be seen that with a change to the skew, the confidence interval now passes through zero whereas before it did not.

This passing through zero is interpreted as the independent variable being statistically indistinguishable from zero influence on the dependent variable.

In [28]:
train_0 = X['capital_loss']
logit_0 = sm.Logit(Y_income['numeric_income'], train_0)
train_1 = X_log_transformed['capital_loss']
logit_1 = sm.Logit(Y_income['numeric_income'], train_1)
# fit the model
result_0 = logit_0.fit(disp=0)
result_1 = logit_1.fit(disp=0)
# Results
print()
print("Original model")
print(result_0.summary2())
print()
print("Transformed model")
print(result_1.summary2())
Original model
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: -0.238    
Dependent Variable: numeric_income   AIC:              62678.9084
Date:               2020-04-27 23:13 BIC:              62687.6278
No. Observations:   45222            Log-Likelihood:   -31338.   
Df Model:           0                LL-Null:          -25322.   
Df Residuals:       45221            LLR p-value:      nan       
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     3.0000                                       
------------------------------------------------------------------
                  Coef.   Std.Err.    z     P>|z|   [0.025  0.975]
------------------------------------------------------------------
capital_loss      0.0001    0.0000  3.7473  0.0002  0.0000  0.0001
=================================================================


Transformed model
                         Results: Logit
=================================================================
Model:              Logit            Pseudo R-squared: -0.238    
Dependent Variable: numeric_income   AIC:              62690.3061
Date:               2020-04-27 23:13 BIC:              62699.0254
No. Observations:   45222            Log-Likelihood:   -31344.   
Df Model:           0                LL-Null:          -25322.   
Df Residuals:       45221            LLR p-value:      nan       
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     3.0000                                       
------------------------------------------------------------------
                 Coef.   Std.Err.    z     P>|z|    [0.025  0.975]
------------------------------------------------------------------
capital_loss     0.0095    0.0058  1.6419  0.1006  -0.0018  0.0207
=================================================================

The logarithmic transformation reduced the skew and the variance of each factor.

Feature Skewness Mean Variance
Capital Loss 4.516154 88.595418 163985.81018
Capital Gain 11.788611 1101.430344 56345246.60482
Log Capital Loss 4.271053 0.355489 2.54688
Log Capital Gain 3.082284 0.740759 6.08362

Eng: Normalization and Standardization

These two terms, normalization and standardization, are frequently used interchangably, but have two different scaling purposes.

  • Normalization: scale values between 0 and 1
  • Standardization: transform data to follow a normal distribution, i.e. $X \sim N\left(\mu=0,\sigma ^{2}=1\right)$

Earlier, capital_gain and capital_loss were transformed logarithmically, reducing their skew, and affecting the model's predictive power (i.e. ability to discern the relationship between the dependent and independent variables).

Another method of influencing the model's predictive power is normalization of independent variables which are numerical. Whereafter, each featured will be treated equally in the model.

However, after scaling is applied, observing the data in its raw form will no longer have the same meaning as before.

Note the output from scaling. age is no longer 39 but is instead 0.30137. This value is meaningful only in context of the rest of the data and not on its own.

In [29]:
scaler = MinMaxScaler(feature_range=(0, 1)) # default=(0, 1)
numerical = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
X_log_minmax = pd.DataFrame(data = X_log_transformed).copy()
X_log_minmax[numerical] = scaler.fit_transform(X_log_transformed[numerical])
print("Original Data")
display(X.head(1))
# Show an example of a record with scaling applied
print("=" * 86)
print("Scaled Data")
display(X_log_minmax.head(1))
Original Data
age workclass education_level education_num marital_status occupation relationship race sex capital_gain ... hours_per_week native_country numeric_workclass numeric_marital_status numeric_occupation numeric_relationship numeric_race numeric_sex numeric_native_country numeric_education_level
0 39 State-gov Bachelors 13.0 Never-married Adm-clerical Not-in-family White Male 2174.0 ... 40.0 United-States 0 0 0 0 0 0 0 3

1 rows × 21 columns

======================================================================================
Scaled Data
age workclass education_level education_num marital_status occupation relationship race sex capital_gain ... hours_per_week native_country numeric_workclass numeric_marital_status numeric_occupation numeric_relationship numeric_race numeric_sex numeric_native_country numeric_education_level
0 0.30137 State-gov Bachelors 0.8 Never-married Adm-clerical Not-in-family White Male 0.667492 ... 0.397959 United-States 0 0 0 0 0 0 0 3

1 rows × 21 columns

In [30]:
fig = make_subplots(rows=4, cols=1)

fig.update_layout(height=800, width=850,
                  title_text="Comparison of Distributions of Continuous Census Data Features",
                  showlegend=False
                 )
fig.add_trace(
    go.Histogram(x=X_log_transformed['capital_loss'], nbinsx=25,
    name='Log of Capital-Loss'),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=X_log_minmax['capital_loss'], nbinsx=25,
    name='Normalized Capital-Loss'),
    row=2, col=1
)
fig.add_trace(
    go.Histogram(x=X_log_transformed['capital_gain'], nbinsx=25,
    name='Log of Capital-Gain'),
    row=3, col=1
)
fig.add_trace(
    go.Histogram(x=X_log_minmax['capital_gain'], nbinsx=25,
    name='Normalized Capital-Gain'),
    row=4, col=1
)
fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Normalized Capital-Loss Feature Distribution", row=2, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=3, col=1)
fig.update_xaxes(title_text="Normalized Capital-Gain Feature Distribution", row=4, col=1)
for i in range(1,5):
    fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                     patch = dict(
                         tickmode = 'array',
                         tickvals = [0, 500, 1000, 1500, 2000],
                         ticktext = [0, 500, 1000, 1500, ">2000"]),
                     row=i, col=1)
fig.show()

Eng: One-Hot-Encoding (i.e. indicator variables)

Earlier, I transformed some categorical values into a numeric mapping. Another, perhaps more common, way to do this is to make dummy variables from the values of those factors. Pandas has a simple method, .get_dummies(), that can perform this very quickly.

To note, this will create a new variable for every value a categorical variable takes:

someFeature someFeature_A someFeature_B someFeature_C
0 B 0 1 0
1 C ----> one-hot encode ----> 0 0 1
2 A 1 0 0

Which means the p, or number of factors, will grow, and can do so potentially in a large way.

It is also worth noting that for modeling, it is important that once value of the factor, a "base case", be dropped from the data. This is because the base case is redundant, i.e. can be infered perfectly from the other cases, and, more specifically and more detrimental to our model, it leads to multicollinearity of the terms.

In some models (e.g. logistic regression, linear regression), an assumption of no multicollinearity must hold.

In [31]:
factors = ['age', 'workclass', 'education_level', 'education_num', 'marital_status',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_per_week', 'native_country',]
# Create dummies, droping the base case
X_trans = pd.get_dummies(X_log_minmax[factors], drop_first=True)
Y = Y_income['numeric_income']
# Print the number of features after one-hot encoding
encoded = list(X_trans.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))
95 total features after one-hot encoding.

Eng: Shuffling and Splitting

After transforming with one-hot-encoding, all categorical variables have been converted into numerical features. Earlier, they were normalized (i.e. scaled between 0 and 1).

Next, for training a machine learning model, it is necessary to split the data into segments. One segment will be used for training the model, the training set, and the other set will be for testing the mode, the testing set.

A common method of splitting is to segment based on proportion of data. A general 80:20 rule is typical for training:test.

sklearn has a method that works well for this, .model_selection.train_test_split. Essentially, this randomly selects a portion of the data to segment to a training and to a testing set.

  • random_state: By setting a seed, option random_state, we can ensure the random splitting is the same for our model. This is necessary for evaluating the effectiveness of the model. Otherwise, we would be training and testing a model with the same proportional split (if we kept that static), but with different observations of the data.

  • test_size: This setting represents the proportion of the data to be tested. Generally, this is the complement (1 - x = c) of the training_size. For example, if test_size is 0.2, the test_size is 0.8.

  • stratify: Preserves the proportion of the label class in the split data. As an example, let 1 and 0 indicate the positive and negative cases of a label, respectively. It's possible that only positive or only negative classes exisst in either training or testing set (e.g. $\forall y \in Y_{train}, y = 1$). Better than avoid this worst case scenario, stratify will preserve the ratio of positive to negative classes in each training and testing set.

Here the data is split 80:20 with a seed set of 0 and the distribution of the label's classes preserved:

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X_trans, Y, random_state=0, test_size=0.2, stratify=Y)
In [33]:
original_ratio = round(Y_income['numeric_income'].value_counts()[1] / Y_income['numeric_income'].value_counts()[0],2)
train_ratio = round(y_train.value_counts()[1] / y_train.value_counts()[0], 2)
test_ratio = round(y_test.value_counts()[1] / y_test.value_counts()[0], 2)
print('Original ratio of positive-to-negative classes: {}'.format(original_ratio))
print('Training ratio of positive-to-negative classes: {}'.format(train_ratio))
print('Testing ratio of positive-to-negative classes: {}'.format(test_ratio))
Original ratio of positive-to-negative classes: 0.33
Training ratio of positive-to-negative classes: 0.33
Testing ratio of positive-to-negative classes: 0.33

ML: Metrics

  1. Accuracy
  2. Precision
  3. Recall
  4. F-$\beta$ Score

In terms of income as a predictor for donating, CharityML has stated they will most likely receive a donation from individuals whose income is in excess of 50,000/yr.

CharityML has limited funds to reach out to potential donors. Misclassifying a person as making more than 50,000yr is COSTLY for CharityML. It's more important that the model accurately predicts a person making more than 50,000/yr (i.e. true-positive) than accidentally predicting they do when they don't (i.e. false-positive).

Metrics: Accuracy

Accuracy is a measure of the correctly predicted data points to total amount of data points:

$$Accuracy=\frac{\sum Correctly\ Classified\ Points}{\sum All\ Points}=\frac{\sum True\ Positives + \sum True\ Negatives}{\sum Observations}$$

A Confusion Matrix demonstrates what a true/false positive/negative is:

Predict 1 Predict 0
True 1 True Positive False Negative
True 0 False Positive True Negative

The errors of these are sometimes refered to as type errors:

Predict 1 Predict 0
True 1 True Positive Type 2 Error
True 0 Type 1 Error True Negative
  • Type 1: a positive class is predicted for a negative class (false positive)
  • Type 2: a negative class is predicted for a positive class (false negative)

For this analysis, we want to avoid false positives or type 1 errors. Put differently, we prefer false negatives to false positives.

A model that meets that criteria, $False\ Negative \succ False\ Positive$, is known as preferring precision over recall, or is a high precision model.

Humorously and perhaps more understandably, these type errors can be demonstrate as such:

Metrics: Precision

Precision is a measure of the amount of correctly predicted positive class to the amount of positive class predictions (correct as well as incorrect predictions of positive class):

$$Precision = \frac{\sum True\ Positives}{\sum True\ Positives + \sum False\ Positives}$$

A model which avoids false positives would have a high precision value, or score. It may also be skewed toward false negatives.

Metrics: Recall

Recall, sometimes refered to as a model's sensitivity, is a measure of the correctly predicted positive classes to the actual amount of positive classes (true positive and false negatives are each actual positive classes):

$$Recall = \frac{\sum True\ Positives}{\sum Actual\ Positives} = \frac{\sum True\ Positives}{\sum True\ Positives + \sum False\ Negatives}$$

A mode which avoids false negatives would have a high recall value, or score. It may also be skewed toward false positives

Metrics: F-$\beta$ Score

An F-$\beta$ Score is a method of scoring a model both on precision and recall.

Where $\beta \in [0,\infty)$:

$$F_{\beta} = \left(1+\beta^{2}\right) \cdot \frac{Precision\ \cdot Recall}{\beta^{2} \cdot Precision + Recall}$$

When $\beta = 0$, we get precision: $$F_{\beta=0} = \left(1+0^{2}\right) \cdot \frac{Precision\ \cdot Recall}{0^{2} \cdot Precision + Recall} = \left(1\right) \cdot \frac{Precision\ \cdot Recall}{Recall} = Precision$$

When $\beta = 1$, we get a harmonized mean of precision and recall:

$$F_{\beta=1} = \left(1+1^{2}\right) \cdot \frac{Precision\ \cdot Recall}{1^{2} \cdot Precision + Recall} = \left(2\right) \cdot \frac{Precision\ \cdot Recall}{Precision + Recall}$$
  • Note: $Harmonic\ Mean = \frac{2xy}{x + y}$

... and when $\beta > 1$, we get something closer to recall:

$$F_{\beta \rightarrow \infty} = \left(1+\beta^{2}\right) \cdot \frac{Precision\ \cdot Recall}{\beta^{2} \cdot Precision + Recall} = \frac{Precision\ \cdot Recall}{\frac{\beta^{2}}{1+\beta^{2}} \cdot Precision + \frac{1}{1+ \beta^{2}} \cdot Recall}$$

As $\beta \rightarrow \infty$: $$\frac{Precision\ \cdot Recall}{\frac{\beta^{2}}{1+\beta^{2}} \cdot Precision + \frac{1}{1+ \beta^{2}} \cdot Recall} \rightarrow \frac{Precision \cdot Recall}{1 \cdot Precision + 0 \cdot Recall} = \frac{Precision}{Precision} \cdot Recall = Recall$$

ML: Proposed Naive Bayes (Benchmark)

The Naive Bayes Classifier will be used as a benchmark model for this work.

Bayes' Theorem is as such:

$$P\left(A|B\right) = \frac{P\left(B|A\right) \cdot P\left(A\right)}{P\left(B\right)}$$

It is considered naive as it assumes each feature is independent of one another.

Bayes Theorem calculates the probability of an outcome (e.g. wether an individual recieves income exceeding 50k/yr), based on the joint probabilistic distributions of certain other events (e.g. any factors we include in the model).

As an example, I propose a model that always predicts an individual makes more than 50k/yr. This model has no false negatives; it has perfect recall (recall = 1).

Note: The purpose of generating a naive predictor is simply to show what a base model without any intelligence would look like. When there is no benchmark model set, getting a result better than random choice is a similar starting point.

Since this model always predicts a 1:

  • All true positives will be found (1 when 1 is true), equal to the sum of the label
  • False positives for this model are the difference between the number of all observations and those correctly predicted (1 when 0 is true)
  • No true negatives will be found (0 when 0 is true) as no 0s are ever predicted
  • No false negatives are predicted (0 when 1 is true) as no 0s are ever predicted

Note: I set $\beta = \frac{1}{2}$ as I want to penalize false positives being costly for CharityML. Recall the implications of setting the values of $\beta$ from before

In [34]:
TP = np.sum(Y)
TN = 0
FP = len(Y) - TP
FN = 0
Beta = 1/2
accuracy = (TP + TN) / len(Y)
recall = TP / (TP + FN)
precision = TP / (TP + FP)
fscore = (1+Beta ** 2) * (precision * recall)/(((Beta ** 2) * precision) + recall)
print("Naive Predictor - Accuracy score: {:.4f}, F-score: {:.4f}".format(accuracy, fscore))
Naive Predictor - Accuracy score: 0.2478, F-score: 0.2917

ML: Logistic Regression

Logistic regression produces probabilites of independent variables indicating a dependent variable. The outcome of logistic regression is bound between 0 and 1 (i.e. $ h_{\theta}\left(X\right) \in \left[0,1\right]$).

$$ h_{\theta}\left(X\right) = P\left(Y=1 | X\right)= \left\{ \begin{array}{ll} y=1 & \frac{1}{1+e^{-\left(\theta^{T}X\right)}} \\ y=0 & 1 - \frac{1}{1+e^{-\left(\theta^{T}X\right)}} \\ \end{array} \right. $$

With a cost function of: $$ cost\left(h_{\theta}\left(X\right)\right) = \left(h_{\theta}\left(X\right)\right) \cdot \left(1 - h_{\theta}\left(X\right)\right)$$

Deriving and Minimizing the Cost Function: How does $ cost\left(h_{\theta}\left(X\right)\right) = \left(h_{\theta}\left(X\right)\right) \cdot \left(1 - h_{\theta}\left(X\right)\right)$, fall out of $\frac{1}{1+e^{-\left(\theta^{T}X\right)}}$ ?

The following math involves a knowledge of some single variable differential calculus, $y = x^{n} \rightarrow \frac{\Delta y}{\Delta x} = -n\cdot x^{n-1}$, and the chain rule, $\frac{\Delta}{\Delta x}f\left(g\left(x\right)\right)= f'\left(g\left(x\right)\right) \cdot g'\left(x\right)$:

$$h\left(x\right) = \frac{1}{1+e^{-x}}$$$$\frac{\Delta h\left(x\right)}{\Delta x} = \frac{\Delta}{\Delta x}\left(1+e^{-x}\right)^{-1}$$$$\because \frac{\Delta}{\Delta x}x^{n} = -n\cdot x^{n-1} \wedge \frac{\Delta}{\Delta x}f\left(g\left(x\right)\right)= f'\left(g\left(x\right)\right) \cdot g'\left(x\right) \implies$$$$\frac{\Delta}{\Delta x}\left(1+e^{-x}\right)^{-1} = -\left(1+e^{-x}\right)^{-2}\left(-e^{-x}\right) = \frac{-e^{-x}}{-\left(1+e^{-x}\right)^{2}} = \frac{e^{-x}}{\left(1+e^{-x}\right)} \cdot \frac{1}{\left(1+e^{-x}\right)}$$$$= \frac{\left(1+e^{-x}\right)-1}{\left(1+e^{-x}\right)} \cdot \frac{1}{\left(1+e^{-x}\right)} = \left(\frac{1+e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)\cdot \frac{1}{1+e^{-x}}$$$$= \left(1-\frac{1}{1+e^{-x}}\right) \cdot \frac{1}{1+e^{-x}} = \left(1-h\left(x\right)\right) \cdot h\left(x\right) \square$$
In [ ]:
 
In [ ]:
 

ML: Model Application Pipeline

It can be useful to establish a routine for aspects related to modeling. This allows for standard comparison of outcomes generated from the same process.

Training

In [35]:
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    """
    Pipeline to train, predict, and score algorithms
    
    :param learner: the learning algorithm to be trained and predicted on
    :param sample_size: the size of samples (number) to be drawn from training set
    :param X_train: features training set
    :param y_train: income training set
    :param X_test: features testing set
    :param y_test: income testing set
    
    :return results: f-0.5 score, 0.5 chosen for high precision, avoiding false positives
    """
    results = {}
    
    # Fitting
    start = time()                                               # Get start time
    learner.fit(X_train[:sample_size], y_train[:sample_size])    # Train model
    end = time()                                                 # Get end time
    results['train_time'] = end - start                          # Calculate the training time
        
    # Predicting
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time
    results['pred_time'] = end - start                           # Calculate the total prediction time
    
    # Scoring
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)         # Training accuracy
    results['acc_test'] = accuracy_score(y_test, predictions_test)                  # Testing accuracy
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)    # Training F-0.5 score
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)             # Testing F-0.5 score
    
    # User feedback
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
    
    return results

Training Iterations

In [36]:
def trainer(classifer):
    """
    Function to train each selected model in a routine fashion for comparison
    :param classifier: classification model from Scikit-Learn to be trained
    return step_results: outcome of training on the data and defined parameters
    """
    step_results = {}
    
    samples_100 = int(len(X_train))
    samples_10 = int(len(X_train) / 10)
    samples_1 = int(len(X_train) / 100)
    
    clf_name = classifer.__class__.__name__
    step_results[clf_name] = {}
    
    for i,sample in enumerate([samples_1, samples_10, samples_100]):
        step_results[clf_name][i] = train_predict(classifer, sample, X_train, y_train, X_test, y_test)
    
    return step_results
In [47]:
def grid_tuner(classifier, parameters):
    """
    Function to tune with grid search in a routine fashion
    :param classifier: classification model from Scikit-Learn to be trained
    return best_predictions: estimator which gave highest score
    """
    scorer = make_scorer(fbeta_score, beta=0.5)
    grid_obj = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=scorer)
    grid_fit = grid_obj.fit(X_train, y_train)
    best_classifier = grid_fit.best_estimator_
    predictions = (classifier.fit(X_train, y_train)).predict(X_test)
    best_predictions = best_classifier.predict(X_test)
    
    outcomes = {'test_acc': accuracy_score(y_test, predictions),
               'f_test': fbeta_score(y_test, predictions, beta = 0.5),
               'tuned_acc': accuracy_score(y_test, best_predictions),
               'f_tuned': fbeta_score(y_test, best_predictions, beta = 0.5),
               'best_param': grid_fit.best_params_}
    
    print("Initial Model:")
    print("\t Accuracy: {:.4f}".format(outcomes['test_acc']))
    print("\t F0.5-Score: {:.4f}".format(outcomes['f_test']))
    print("Tuned Model:")
    print("\t Accuracy: {:.4f}".format(outcomes['tuned_acc']))
    print("\t F0.5-Score: {:.4f}".format(outcomes['f_tuned']))
    print("Best Parameters:")
    print("\t {}".format(outcomes['best_param']))

    return outcomes
In [38]:
log_reg_0 = trainer(classifer=LogisticRegression(random_state=0))
print("Logistic Regression: Default")
pd.DataFrame.from_dict(log_reg_0['LogisticRegression'], orient='index')
LogisticRegression trained on 361 samples.
LogisticRegression trained on 3617 samples.
LogisticRegression trained on 36177 samples.
Logistic Regression: Default
/Users/daiglechris/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[38]:
train_time pred_time acc_train acc_test f_train f_test
0 0.008859 0.005293 0.903333 0.820785 0.833333 0.642790
1 0.038103 0.005444 0.863333 0.839027 0.705128 0.682561
2 0.507330 0.003216 0.873333 0.841902 0.730519 0.691478
In [39]:
log_reg_1 = trainer(classifer=LogisticRegression(penalty='l2', max_iter=500, random_state=0, solver='liblinear'))
print("Logistic Regression: Thoughtful")
pd.DataFrame.from_dict(log_reg_1['LogisticRegression'], orient='index')
LogisticRegression trained on 361 samples.
LogisticRegression trained on 3617 samples.
LogisticRegression trained on 36177 samples.
Logistic Regression: Thoughtful
Out[39]:
train_time pred_time acc_train acc_test f_train f_test
0 0.002572 0.005120 0.896667 0.817800 0.808824 0.634809
1 0.014408 0.005628 0.860000 0.838695 0.696203 0.681541
2 0.216720 0.003606 0.876667 0.841680 0.740132 0.690949
In [ ]:
 
In [40]:
rand_for_0 = trainer(classifer=RandomForestClassifier(random_state=0))
print("Random Forest: Default")
pd.DataFrame.from_dict(rand_for_0['RandomForestClassifier'], orient='index')
RandomForestClassifier trained on 361 samples.
RandomForestClassifier trained on 3617 samples.
RandomForestClassifier trained on 36177 samples.
Random Forest: Default
Out[40]:
train_time pred_time acc_train acc_test f_train f_test
0 0.111178 0.077738 1.000000 0.822001 1.000000 0.644699
1 0.303180 0.122575 0.996667 0.837811 0.988372 0.676572
2 4.077604 0.187464 0.983333 0.841570 0.975610 0.685871
In [41]:
rand_for_1 = trainer(classifer=RandomForestClassifier(n_estimators=500, min_samples_leaf=25,random_state=0))
print("Random Forest: Tuned")
pd.DataFrame.from_dict(rand_for_1['RandomForestClassifier'], orient='index')
RandomForestClassifier trained on 361 samples.
RandomForestClassifier trained on 3617 samples.
RandomForestClassifier trained on 36177 samples.
Random Forest: Tuned
Out[41]:
train_time pred_time acc_train acc_test f_train f_test
0 0.485081 0.216865 0.773333 0.752128 0.000000 0.000000
1 0.879337 0.327733 0.873333 0.840022 0.753968 0.699949
2 10.417206 0.425449 0.883333 0.854063 0.772059 0.727489
In [42]:
abc_0 = trainer(classifer=AdaBoostClassifier(random_state=0))
print("Ada Boost Classifier: Default")
pd.DataFrame.from_dict(abc_0['AdaBoostClassifier'], orient='index')
AdaBoostClassifier trained on 361 samples.
AdaBoostClassifier trained on 3617 samples.
AdaBoostClassifier trained on 36177 samples.
Ada Boost Classifier: Default
Out[42]:
train_time pred_time acc_train acc_test f_train f_test
0 0.059702 0.127341 0.933333 0.827418 0.870253 0.656731
1 0.166007 0.121281 0.890000 0.847319 0.765625 0.700140
2 1.571490 0.115430 0.886667 0.860918 0.774648 0.738187
In [48]:
%%time
parameters = {'n_estimators': [200, 400],
              "learning_rate": [1, 1.5]}
abc_tune_0 = grid_tuner(classifier=AdaBoostClassifier(random_state=0), parameters=parameters)
Initial Model:
	 Accuracy: 0.8609
	 F0.5-Score: 0.7382
Tuned Model:
	 Accuracy: 0.8700
	 F0.5-Score: 0.7568
Best Parameters:
	 {'learning_rate': 1.5, 'n_estimators': 400}
CPU times: user 2min 34s, sys: 2.72 s, total: 2min 37s
Wall time: 2min 37s
In [51]:
# Initialize the three models
clf_A = LogisticRegression(penalty='l2', max_iter=500, solver='liblinear', random_state=0)
clf_B = AdaBoostClassifier(n_estimators=400, learning_rate=1.5, random_state=0)
clf_C = RandomForestClassifier(n_estimators=500, min_samples_leaf=25,random_state=0)

# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_100 = int(len(X_train))
samples_10 = int(len(X_train) / 10)
samples_1 = int(len(X_train) / 100)

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)
LogisticRegression trained on 361 samples.
LogisticRegression trained on 3617 samples.
LogisticRegression trained on 36177 samples.
AdaBoostClassifier trained on 361 samples.
AdaBoostClassifier trained on 3617 samples.
AdaBoostClassifier trained on 36177 samples.
RandomForestClassifier trained on 361 samples.
RandomForestClassifier trained on 3617 samples.
RandomForestClassifier trained on 36177 samples.
In [52]:
# Run metrics visualization for the three supervised learning models chosen
sns.set()
vs.evaluate(results, accuracy, fscore)
/Users/daiglechris/Git/portfolio/machine_learning/supervised_learning/ACCEPTEDDaigleFindingDonorsResubmit/visuals.py:118: UserWarning:

Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations

In [53]:
# # Full Page - Code
!jupyter nbconvert WIP_Donor_Classification.ipynb --output WIP_Class_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none
# # Full Page - No Code
!jupyter nbconvert WIP_Donor_Classification.ipynb --output WIP_Class_No_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none --TemplateExporter.exclude_input=True
# # Slides - No Code
!jupyter nbconvert --to slides WIP_Donor_Classification.ipynb --output WIP_Class_Slides --TemplateExporter.exclude_input=True --SlidesExporter.reveal_transition=none --SlidesExporter.reveal_scroll=True
[NbConvertApp] Converting notebook WIP_Donor_Classification.ipynb to html
[NbConvertApp] Writing 13583986 bytes to WIP_Class_Code.html
[NbConvertApp] Converting notebook WIP_Donor_Classification.ipynb to html
[NbConvertApp] Writing 13460507 bytes to WIP_Class_No_Code.html
[NbConvertApp] Converting notebook WIP_Donor_Classification.ipynb to slides
[NbConvertApp] Writing 13467032 bytes to WIP_Class_Slides.slides.html